Load the dataset

The collected dataset includes information about houses on sale in the Dublin area. Each house is an entry of the dataset: a mixed-type data comprising of numerical, categorical and textual data.

The goal is to combine both numerical/categorical features and textual features to predict the house-price.

The house price is determined by some factors like

The physical attributes of the house such as the number of bedrooms, the number of bathrooms, the surface of the house, property type, and its location are information that is directly accessible from the dataset. Instead, the house-features can (sometimes only indirectly) be inferred from the house-description, house-facility and house-features. You can see a typical entry in the dataset hereafter

data <- read.csv(file = 'train.csv',sep="," )
data[10:28,3:17]#one of the entries, there are 17 columns, the first two columns are just ids

Data Cleaning, Covariate selection and preprocessing

We select some of the columns (‘bathrooms’,‘beds’,‘surface’) we will use as predictors for price

datasel = data[c('bathrooms','beds','surface','price')]
datasel = na.omit(datasel)# we remove all the rows including nan
datasel

Linear regression

We now fit linear regression

model = lm(price ~ bathrooms + beds + surface, data = datasel)
summary(model)

Call:
lm(formula = price ~ bathrooms + beds + surface, data = datasel)

Residuals:
     Min       1Q   Median       3Q      Max 
-3228439  -195188   -56583    79729  7778095 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.288e+05  2.437e+04  -5.283 1.39e-07 ***
bathrooms    1.361e+05  1.178e+04  11.557  < 2e-16 ***
beds         1.392e+05  1.030e+04  13.515  < 2e-16 ***
surface      7.898e+00  2.323e+00   3.400 0.000684 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 501600 on 2398 degrees of freedom
Multiple R-squared:  0.3044,    Adjusted R-squared:  0.3035 
F-statistic: 349.8 on 3 and 2398 DF,  p-value: < 2.2e-16

Is this a good model? Can we use other columns in data to improve the model? Can we include polynomial and/or interaction terms to improve the model? Use the model selection approaches you learned in session 5 to find a better model.

Unseen data

You can test the predictive performance of our best model on unseen data

datatest <- read.csv(file = 'test.csv',sep="," )
datatest[10:28,3:16]#one of the entries, there are 16 columns, the first two columns are just ids. The price column is not reported. You have to predict the price for all the entries in dataset

Prediction

predictions <- predict(model,datatest)
predictions[1:5]
       1        2        3        4        5 
701423.2 561985.5 837755.7 834321.9 425684.7 

these are the predicted prices for 5 houses in the dataset. You can save and submit your best predictions for our internal data science competition. This is the code:

write.csv(predictions,"name_surname.csv")

We will use MAPE: Mean absolute percentage error, to evaluate the accuracy of your predictions.

LS0tCnRpdGxlOiAiU2Vzc2lvbiA1OiBsaW5lYXIgcmVncmVzc2lvbiIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKIyBMb2FkIHRoZSBkYXRhc2V0IApUaGUgY29sbGVjdGVkIGRhdGFzZXQgaW5jbHVkZXMgaW5mb3JtYXRpb24gYWJvdXQgaG91c2VzIG9uIHNhbGUgaW4gdGhlIER1YmxpbiBhcmVhLiBFYWNoIGhvdXNlIGlzIGFuIGVudHJ5IG9mIHRoZSBkYXRhc2V0OiBhIG1peGVkLXR5cGUgZGF0YSBjb21wcmlzaW5nIG9mIG51bWVyaWNhbCwgY2F0ZWdvcmljYWwgYW5kIHRleHR1YWwgZGF0YS4KClRoZSBnb2FsIGlzIHRvIGNvbWJpbmUgYm90aCBudW1lcmljYWwvY2F0ZWdvcmljYWwgZmVhdHVyZXMgYW5kIHRleHR1YWwgZmVhdHVyZXMgdG8gcHJlZGljdCB0aGUgaG91c2UtcHJpY2UuCgpUaGUgaG91c2UgcHJpY2UgaXMgZGV0ZXJtaW5lZCBieSBzb21lIGZhY3RvcnMgbGlrZQoKKiBsb2NhdGlvbiAoYXJlYSksCiogc3VyZmFjZSAoc2l6ZSksCiogdGhlIG51bWJlciBvZiBiZWRyb29tcywKKiB0aGUgbnVtYmVyIG9mIGJhdGhyb29tcywKKiBwcm9wZXJ0eSB0eXBlLAoqIGhvdXNlLWZlYXR1cmVzIChzaXplIG9mIHRoZSB3aW5kb3dzLCBjb25zdHJ1Y3Rpb24gbWF0ZXJpYWwpLgoKVGhlIHBoeXNpY2FsIGF0dHJpYnV0ZXMgb2YgdGhlIGhvdXNlIHN1Y2ggYXMgdGhlIG51bWJlciBvZiBiZWRyb29tcywgdGhlIG51bWJlciBvZiBiYXRocm9vbXMsIHRoZSBzdXJmYWNlIG9mIHRoZSBob3VzZSwgcHJvcGVydHkgdHlwZSwgYW5kIGl0cyBsb2NhdGlvbiBhcmUgaW5mb3JtYXRpb24gdGhhdCBpcyBkaXJlY3RseSBhY2Nlc3NpYmxlIGZyb20gdGhlIGRhdGFzZXQuCkluc3RlYWQsIHRoZSBob3VzZS1mZWF0dXJlcyBjYW4gKHNvbWV0aW1lcyBvbmx5IGluZGlyZWN0bHkpIGJlIGluZmVycmVkIGZyb20gdGhlIGhvdXNlLWRlc2NyaXB0aW9uLCBob3VzZS1mYWNpbGl0eSBhbmQgaG91c2UtZmVhdHVyZXMuCllvdSBjYW4gc2VlIGEgdHlwaWNhbCBlbnRyeSBpbiB0aGUgZGF0YXNldCBoZXJlYWZ0ZXIKCmBgYHtyfQpkYXRhIDwtIHJlYWQuY3N2KGZpbGUgPSAndHJhaW4uY3N2JyxzZXA9IiwiICkKZGF0YVsxMDoyOCwzOjE3XSNvbmUgb2YgdGhlIGVudHJpZXMsIHRoZXJlIGFyZSAxNyBjb2x1bW5zLCB0aGUgZmlyc3QgdHdvIGNvbHVtbnMgYXJlIGp1c3QgaWRzCmBgYAoKIyBEYXRhIENsZWFuaW5nLCBDb3ZhcmlhdGUgc2VsZWN0aW9uIGFuZCBwcmVwcm9jZXNzaW5nCldlIHNlbGVjdCBzb21lIG9mIHRoZSBjb2x1bW5zICgnYmF0aHJvb21zJywnYmVkcycsJ3N1cmZhY2UnKSB3ZSB3aWxsIHVzZSBhcyBwcmVkaWN0b3JzIGZvciBwcmljZQpgYGB7cn0KZGF0YXNlbCA9IGRhdGFbYygnYmF0aHJvb21zJywnYmVkcycsJ3N1cmZhY2UnLCdwcmljZScpXQpkYXRhc2VsID0gbmEub21pdChkYXRhc2VsKSMgd2UgcmVtb3ZlIGFsbCB0aGUgcm93cyBpbmNsdWRpbmcgbmFuCmRhdGFzZWwKYGBgCgojIExpbmVhciByZWdyZXNzaW9uCldlIG5vdyBmaXQgbGluZWFyIHJlZ3Jlc3Npb24KYGBge3J9Cm1vZGVsID0gbG0ocHJpY2UgfiBiYXRocm9vbXMgKyBiZWRzICsgc3VyZmFjZSwgZGF0YSA9IGRhdGFzZWwpCnN1bW1hcnkobW9kZWwpCmBgYApJcyB0aGlzIGEgZ29vZCBtb2RlbD8gQ2FuIHdlIHVzZSBvdGhlciBjb2x1bW5zIGluIGBkYXRhYCB0byBpbXByb3ZlIHRoZSBtb2RlbD8KQ2FuIHdlIGluY2x1ZGUgcG9seW5vbWlhbCBhbmQvb3IgaW50ZXJhY3Rpb24gdGVybXMgdG8gaW1wcm92ZSB0aGUgbW9kZWw/ClVzZSB0aGUgbW9kZWwgc2VsZWN0aW9uIGFwcHJvYWNoZXMgeW91IGxlYXJuZWQgaW4gc2Vzc2lvbiA1IHRvIGZpbmQgYSBiZXR0ZXIgbW9kZWwuCgoKIyBVbnNlZW4gZGF0YQpZb3UgY2FuIHRlc3QgdGhlIHByZWRpY3RpdmUgcGVyZm9ybWFuY2Ugb2Ygb3VyIGJlc3QgbW9kZWwgb24gdW5zZWVuIGRhdGEKYGBge3J9CmRhdGF0ZXN0IDwtIHJlYWQuY3N2KGZpbGUgPSAndGVzdC5jc3YnLHNlcD0iLCIgKQpkYXRhdGVzdFsxMDoyOCwzOjE2XSNvbmUgb2YgdGhlIGVudHJpZXMsIHRoZXJlIGFyZSAxNiBjb2x1bW5zLCB0aGUgZmlyc3QgdHdvIGNvbHVtbnMgYXJlIGp1c3QgaWRzLiBUaGUgcHJpY2UgY29sdW1uIGlzIG5vdCByZXBvcnRlZC4gWW91IGhhdmUgdG8gcHJlZGljdCB0aGUgcHJpY2UgZm9yIGFsbCB0aGUgZW50cmllcyBpbiBkYXRhc2V0CmBgYAoKUHJlZGljdGlvbgpgYGB7cn0KcHJlZGljdGlvbnMgPC0gcHJlZGljdChtb2RlbCxkYXRhdGVzdCkKcHJlZGljdGlvbnNbMTo1XQpgYGAKdGhlc2UgYXJlIHRoZSBwcmVkaWN0ZWQgcHJpY2VzIGZvciA1IGhvdXNlcyBpbiB0aGUgZGF0YXNldC4gWW91IGNhbiBzYXZlIGFuZCBzdWJtaXQgeW91ciBiZXN0IHByZWRpY3Rpb25zIGZvciBvdXIgaW50ZXJuYWwgZGF0YSBzY2llbmNlIGNvbXBldGl0aW9uLiBUaGlzIGlzIHRoZQpjb2RlOgoKYGBge3J9CndyaXRlLmNzdihwcmVkaWN0aW9ucywibmFtZV9zdXJuYW1lLmNzdiIpCmBgYApXZSB3aWxsIHVzZSBNQVBFOiBNZWFuIGFic29sdXRlIHBlcmNlbnRhZ2UgZXJyb3IsIHRvIGV2YWx1YXRlIHRoZSBhY2N1cmFjeSBvZgp5b3VyIHByZWRpY3Rpb25zLg==